Preliminary Analysis

#Load data
import pandas as pd

df = pd.read_csv("source_data.csv", sep = ",")
#Get dimensions and look at data head
print(df.shape)
df.head()
(7917, 12)
trip_id drop_sequence distance_calculated distance_covered_till_drop origin_warehouse_code destination_warehouse_code total_quantity actual_transit_time_from_cnf_minutes trip_created_date trip_start_time geofence_in_time geofence_out_time
0 23297 1 11.0 11.0 C220 25001891 2 12.875800 2023-02-01 2023-02-01T10:50:06.452 2023-02-01T11:02:59 2023-02-01T11:15:23
1 23297 2 16.0 27.0 C220 10015094 30 92.809133 2023-02-01 2023-02-01T10:50:06.452 2023-02-01T12:22:55 2023-02-01T12:36:09.980
2 23298 1 14.0 14.0 C220 10003303 65 41.280700 2023-02-01 2023-02-01T10:52:08.158 2023-02-01T11:33:25 2023-02-01T11:56:15.964
3 23372 1 14.0 14.0 C220 10003303 100 17.953983 2023-02-01 2023-02-01T11:26:10.013 2023-02-01T11:44:07.252 2023-02-01T12:06:06.993
4 23595 1 18.0 18.0 C220 10013469 24 63.004467 2023-02-01 2023-02-01T13:18:07.732 2023-02-01T14:21:08 2023-02-01T14:35:58
#Get dataset structure, datatypes and missing values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7917 entries, 0 to 7916
Data columns (total 12 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   trip_id                               7917 non-null   int64  
 1   drop_sequence                         7917 non-null   int64  
 2   distance_calculated                   7887 non-null   float64
 3   distance_covered_till_drop            7904 non-null   float64
 4   origin_warehouse_code                 7917 non-null   object 
 5   destination_warehouse_code            7917 non-null   int64  
 6   total_quantity                        7917 non-null   int64  
 7   actual_transit_time_from_cnf_minutes  7887 non-null   float64
 8   trip_created_date                     7917 non-null   object 
 9   trip_start_time                       7907 non-null   object 
 10  geofence_in_time                      7897 non-null   object 
 11  geofence_out_time                     7854 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 742.3+ KB

Data Dictionary

Column Name Column Description
trip_id Unique Identifier for trip
drop_sequence Drop Sequence of Milk Run of the particular trip
distance_calculated distance from the Depot ( Origin )
distance_covered_till_drop distance covered from the Depot ( Origin ) to the Destination
origin_warehouse_code Unique code for Origin Depot/Warehouse
destination_warehouse_code Destination Dealer/Warehouse Code
total_quantity Total Quantity of goods delivered to particular Destination Warehouse/Dealer
actual_transit_time_from_cnf_minutes Transit Time taken in minutes to be delivered to particular Destination Warehouse/Dealer
trip_created_date Trip Date
trip_start_time Start Time of the trip
geofence_in_time Geofence In Time to the destination Warehouse/Dealer
geofence_out_time Geofence Out Time to the destination Warehouse/Dealer

Data Preparation

Dealing with datatypes

#Get column info, analyze datatypes and missing values
print(df.dtypes)
trip_id                                   int64
drop_sequence                             int64
distance_calculated                     float64
distance_covered_till_drop              float64
origin_warehouse_code                    object
destination_warehouse_code                int64
total_quantity                            int64
actual_transit_time_from_cnf_minutes    float64
trip_created_date                        object
trip_start_time                          object
geofence_in_time                         object
geofence_out_time                        object
dtype: object
df["trip_id"] = df["trip_id"].astype('category')
df["origin_warehouse_code"] = df["origin_warehouse_code"].astype('category')
df["destination_warehouse_code"] = df["destination_warehouse_code"].astype('category')

df["total_quantity"] = df["total_quantity"].astype('float')

df["trip_created_date"] = pd.to_datetime(df["trip_created_date"])
df["trip_start_time"] = pd.to_datetime(df["trip_start_time"])
df["geofence_in_time"] = pd.to_datetime(df["geofence_in_time"])
df["geofence_out_time"] = pd.to_datetime(df["geofence_out_time"])
#Confirm datatype changes
print(df.dtypes)
trip_id                                       category
drop_sequence                                    int64
distance_calculated                            float64
distance_covered_till_drop                     float64
origin_warehouse_code                         category
destination_warehouse_code                    category
total_quantity                                 float64
actual_transit_time_from_cnf_minutes           float64
trip_created_date                       datetime64[ns]
trip_start_time                         datetime64[ns]
geofence_in_time                        datetime64[ns]
geofence_out_time                       datetime64[ns]
dtype: object

Dealing with Missing Data

#determine missing data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7917 entries, 0 to 7916
Data columns (total 12 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   trip_id                               7917 non-null   category      
 1   drop_sequence                         7917 non-null   int64         
 2   distance_calculated                   7887 non-null   float64       
 3   distance_covered_till_drop            7904 non-null   float64       
 4   origin_warehouse_code                 7917 non-null   category      
 5   destination_warehouse_code            7917 non-null   category      
 6   total_quantity                        7917 non-null   float64       
 7   actual_transit_time_from_cnf_minutes  7887 non-null   float64       
 8   trip_created_date                     7917 non-null   datetime64[ns]
 9   trip_start_time                       7907 non-null   datetime64[ns]
 10  geofence_in_time                      7897 non-null   datetime64[ns]
 11  geofence_out_time                     7854 non-null   datetime64[ns]
dtypes: category(3), datetime64[ns](4), float64(4), int64(1)
memory usage: 679.5 KB

mention columns with missing data

#Get percentage values of missing data
print(f"Total Number of rows: {len(df)}")
print(f"Total number of null values: {df.isnull().sum().sum()}")

print(100*(df.isnull().sum())/len(df))
Total Number of rows: 7917
Total number of null values: 166
trip_id                                 0.000000
drop_sequence                           0.000000
distance_calculated                     0.378931
distance_covered_till_drop              0.164204
origin_warehouse_code                   0.000000
destination_warehouse_code              0.000000
total_quantity                          0.000000
actual_transit_time_from_cnf_minutes    0.378931
trip_created_date                       0.000000
trip_start_time                         0.126310
geofence_in_time                        0.252621
geofence_out_time                       0.795756
dtype: float64

Since less that 0.5% is missing in most cases, safe to remove. Time constraints mean that methods like exploration for randomness, and possibly imputation cannot be performed.

Geofence out time will have to be dealt with since it has the largest contingent of missing values. To potentially save ourself from removing unnecessary rows, especially if this column is unimportant to the model, we'll have to figure out a way to use the column with missing data.

See if correaltions with columns with missing data so we can remove rows with confidence.

#visualize missing data for patterns
import missingno as msno

# msno.matrix(df, figsize=(10, 150))
msno.matrix(df)
<Axes: >
#Remove rows with any Null values
df = df.dropna(how='any',axis=0) 

Dealing with Outliers

Conduct tests for simple outliers. (i.e. distance and time cannot be 0 or negative) Check for very high or low values. No time for in multivariate outlier detection or in depth analysis.

print(len(df[df["drop_sequence"] <= 0]))
print(len(df[df["distance_calculated"] <= 0]))
print(len(df[df["distance_covered_till_drop"] <= 0]))
print(len(df[df["total_quantity"] <= 0]))
print(len(df[df["actual_transit_time_from_cnf_minutes"] <= 0]))
0
425
46
3
45
#Only save data points with numeric value greater than 0

df = df[df["drop_sequence"] > 0]
df = df[df["distance_calculated"] > 0]
df = df[df["distance_covered_till_drop"] > 0]
df = df[df["total_quantity"] > 0]
df = df[df["actual_transit_time_from_cnf_minutes"] > 0]
#Drop values on either side of interquatile range
cols = ['drop_sequence', 'distance_calculated','distance_covered_till_drop', 'total_quantity', 'actual_transit_time_from_cnf_minutes']

Q1 = df[cols].quantile(0.25)
Q3 = df[cols].quantile(0.75)
IQR = Q3 - Q1

df = df[~((df[cols] < (Q1 - 1.5 * IQR)) |(df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
print(len(df))
6156

Basic statistics and Cardinality

Descriptive statistics

df.describe().T
count mean std min 25% 50% 75% max
drop_sequence 6156.0 3.966537 2.609957 1.000000 2.000000 3.000000 6.000000 12.0000
distance_calculated 6156.0 5.398635 4.661921 1.000000 2.000000 4.000000 7.000000 19.0000
distance_covered_till_drop 6156.0 20.154483 13.155102 1.000000 10.000000 18.000000 29.000000 62.0000
total_quantity 6156.0 8.734893 6.970721 1.000000 4.000000 7.000000 12.000000 31.0000
actual_transit_time_from_cnf_minutes 6156.0 104.031281 65.319121 0.019417 52.026633 98.752883 149.091333 305.8469

Understanding cardinality

Check percentage of unique values for categorical. See unique values for understanding.

print(df['origin_warehouse_code'].value_counts())
C220    6156
Name: origin_warehouse_code, dtype: int64

We can see that the origin warehouse is the same across all datapoints.

print(df['destination_warehouse_code'].nunique())
134

We can see that the destination warehouse spans 134 different values.

print(df['destination_warehouse_code'].value_counts())
10004333    189
10013857    182
10013792    172
25001972    169
10002797    169
           ... 
25003515      0
10009653      0
25003677      0
25002004      0
25003877      0
Name: destination_warehouse_code, Length: 140, dtype: int64

Feature Engineering

Splitting time

Subsequent drops in the sequence uses the previous drop's geofence out time to get accurate readings of trip hour.

df[["year", "month", "m_day", "hour"]] = df["geofence_out_time"].shift(1).apply(lambda x: x.timetuple()[0:4] if pd.notna(x) else (2023, 2, 1, 0)).tolist()
df.loc[(df.drop_sequence == 1), ["year", "month", "m_day", "hour"]] = df.loc[(df.drop_sequence == 1), "trip_start_time"].apply(lambda x: x.timetuple()[0:4]).tolist()

df["w_day"] = df["geofence_out_time"].shift(1).apply(lambda x: x.weekday() if pd.notna(x) else 0)
df.loc[(df.drop_sequence == 1), "w_day"] = df.loc[(df.drop_sequence == 1), "trip_start_time"].apply(lambda x: x.weekday())

Univariate Analysis

Target: actual_transit_time

df["actual_transit_time_from_cnf_minutes"].describe()
count    6156.000000
mean      104.031281
std        65.319121
min         0.019417
25%        52.026633
50%        98.752883
75%       149.091333
max       305.846900
Name: actual_transit_time_from_cnf_minutes, dtype: float64
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df, x = 'actual_transit_time_from_cnf_minutes')
plt.xlim(df["actual_transit_time_from_cnf_minutes"].min(), df["actual_transit_time_from_cnf_minutes"].max())
(0.019416667, 305.8469)
from distfit import distfit
import numpy as np

dist = distfit()
dist.fit_transform(df["actual_transit_time_from_cnf_minutes"], 0)
dist.plot(figsize=(5,3))
(<Figure size 360x216 with 1 Axes>,
 <Axes: title={'center': '\nbeta(a=1.17946, b=2.36291, loc=0.00616211, scale=307.497)'}, xlabel='Values', ylabel='Frequency'>)

We can observe that the target conforms to a beta distribution with a = 1.18 and b = 2.36.

Drop Sequence

sns.histplot(df, x = 'drop_sequence')
plt.xlim(df["drop_sequence"].min(), df["drop_sequence"].max())
(1.0, 12.0)

Sequence drops are concentrated in the lower values.

distance_calculated


Represents the distance travelled in each trip.

df["distance_calculated"].describe()
count    6156.000000
mean        5.398635
std         4.661921
min         1.000000
25%         2.000000
50%         4.000000
75%         7.000000
max        19.000000
Name: distance_calculated, dtype: float64
sns.histplot(df, x = 'distance_calculated')
plt.xlim(df["distance_calculated"].min(), df["distance_calculated"].max())
(1.0, 19.0)

distance_covered_till_drop

Represents the distance travelled in each journey.

df["distance_covered_till_drop"].describe()
count    6156.000000
mean       20.154483
std        13.155102
min         1.000000
25%        10.000000
50%        18.000000
75%        29.000000
max        62.000000
Name: distance_covered_till_drop, dtype: float64
sns.histplot(df, x = 'distance_covered_till_drop')
plt.xlim(df["distance_covered_till_drop"].min(), df["distance_covered_till_drop"].max())
(1.0, 62.0)

total_quantity

df["total_quantity"].describe()
count    6156.000000
mean        8.734893
std         6.970721
min         1.000000
25%         4.000000
50%         7.000000
75%        12.000000
max        31.000000
Name: total_quantity, dtype: float64
sns.histplot(df, x = 'total_quantity')
plt.xlim(df["total_quantity"].min(), df["total_quantity"].max())
(1.0, 31.0)

Day of Month

df[["year", "month", "m_day", "hour"]]
df["m_day"].describe()
count    6156.000000
mean       15.657895
std         8.641572
min         1.000000
25%         8.000000
50%        15.000000
75%        23.000000
max        31.000000
Name: m_day, dtype: float64
sns.histplot(df, x = 'm_day')
plt.xlim(df["m_day"].min(), df["m_day"].max())
(1.0, 31.0)

Can observe the drop when approaching the weekend.

Day of Week

df['day_name'] = df['trip_start_time'].dt.day_name()
sns.histplot(df, x = 'day_name')
<Axes: xlabel='day_name', ylabel='Count'>

Hour

df["hour"].describe()
count    6156.000000
mean       14.890026
std         2.222788
min         0.000000
25%        13.000000
50%        15.000000
75%        17.000000
max        23.000000
Name: hour, dtype: float64
sns.histplot(df, x = 'hour')
plt.xlim(df["hour"].min(), df["hour"].max())
(0.0, 23.0)

Month

sns.histplot(df, x = 'month')
plt.xlim(df["month"].min(), df["month"].max())
(2.0, 9.0)

Bivariate Analysis

Between distance and time

plt.scatter(df["distance_covered_till_drop"], df["actual_transit_time_from_cnf_minutes"])
plt.show()

Analyzing trend of actual_transit_time

plt.plot(df["trip_start_time"], df["actual_transit_time_from_cnf_minutes"])
[<matplotlib.lines.Line2D at 0x27e8e6ab760>]

No appreciable pattern can be derived from the line graph.

Analyzing trend of distance_covered

plt.plot(df["trip_start_time"], df["distance_covered_till_drop"])
[<matplotlib.lines.Line2D at 0x27e8e723610>]

No appreciable pattern can be derived from the line graph.

Between quantity and time

plt.scatter(df["total_quantity"], df["actual_transit_time_from_cnf_minutes"])
plt.show()

No viable correlation can be determined between total_quantity and actual_transit_time.

Correlation Analysis

corr_matrix = df.drop(['year'], axis=1).corr()
sns.heatmap(corr_matrix, annot=True)
plt.show()